## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.878
## 3rd Qu.:6.000
## Max. :9.000
Looks like we will be looking at how quality, the only categorical variable, interacts with other (combinations of) variables, as well es interactions between variable that seem intuitively linked, e.g. free vs total sulfur dioxide or pH vs acidity levels. For most variables, the median and mean are quite close to each other, suggesting normal distributions. Let’s take a look at a selection of them:
Just to get an overview, I’ve plotted 10 possibly interesting variables as histograms, leaving out chlorides and density. I left out the last two, as I think the analysis will for the most part rest on examining the relationship between quality and some combination of other variables. From the description that came with the dataset, it seemed unlikely that either of these two variables impact taste too much, so to simplify things, I left them out. All histograms apart from residual sugar and alcohol approximate a normal distribution. most skewing right slightly.
Quality ranges from 3 to 9, with the vast majority of wines being around a 5 or 6, i.e. squarely in the middle of the range (from 0-10).
Alcohol distribution looks fairly normal, skewed right and with perhaps a little bimodal bump at ca. 12.5?
Taking the log10 of alcohol makes the histogram appear a little more “normal”, but at this stage does not help me understand the alcohol variable better.
Residual sugar skews right very strongly, with the highest peak at about 1.
Taking the log10 of residual sugar and using a bin width of 0.05 gives us what looks like a binomial distribution. So there is another cluster of values in the tail. Let’s look at an even smaller bin size and see what that shows us.
Now we can see the smaller “bump”" between 0.75 and 1.25. So this shows us in a bit more detail what is happening in the tail of the histogram.
Subsetting for high quality wines (8 and 9) shows they also have higher alcohol levels (lower histogram).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 11.00 12.00 11.65 12.60 14.00
This summary confirms what the plots showed us: high quality wines tend to have high(er) alcohol levels. This is shown most clearly by the higher median and by the fact that the 1st Quartile is a lot higher for the higher quality wines.
Choosing a smaller binwidth lets us see a weird spike at 0.5. Let’s zoom in on it.
Looks less “extreme” like this, but still noticeable. Is it just a fluke?
Taking the data subset for higher quality, we see a far narrower band of citric acid levels. Does a specific range of citric acid (not too little and not too much) lead to wines tasting better? Perhaps in combination with another variable.
Fixed acidity shows a nice normal distribution. There is no spike here coresponding to the citric acid variable, as the bulk of the measurements are above 3, whila almost all values for citric acid are below 1 g / dm^3.
This distribution has a slightly longer right tail than fixed acidity. Is there very slight bump at ca. 0.5? Let’s zoom in.
There doesn’t seem to be anything very interesting there. I also noticed that it probably doesn’t make sense to directly compare volatile to citric acidity, as there are simply far higher levels of volatile acidity compared to citric acidity.
There are 4898 obs. of 13 variables, with the first variable just being a counter of the samples, so it can be excluded for the most part. All the other variables are quantitative apart from “quality”" which is categorical (values from 0-10).
I would say, how quality relates to the other variables. Which chemical properties of the wines correlate in which way with quality?
Several variables seem prima facie connected, e.g.free and total sulfur dioxide as well as the acidity variables. (However, I might be wrong, especially concerning citric acid)
No.
Residual sugar and alcohol were very strongly right-skewed and had a possibly binomial distribution, respectively. Citric acid has a big peak at ca. 0.5 1 g / dm^3. I subsetted the data to only include the best rated wines (quality). I did this to get a first impression of what properties highly rated wines have. Preliminarily I would say that higher rated wines seem to have higher levels of alcohol. The high quality wines also seem to have a smaller range of citric acid levels.
The variable of interest, the one you’d want to predict, would be quality. But there don’t seem to be many strong (direct) correlations, apart from density vs. residual sugar.
So from about wines with quality 7, more wines have higher alcohol contents than lower. For wines quality 5 and 6, the majority of samples have lower (below 11 and below 10) levels of alcohol. So this bears out what we saw in the matrix - that higher quality wines have a higher alcohol content.
The same data as a boxplot: this shows at a glance how the average alcohol content of wine samples (per quality rating) increases with higher quality ratings.
A general trend of lower residual sugar the more alcohol a given wines contains is visible and gets more pronounced the more alcohol a sample contains. There is also one extreme outlier with a very high amount of sugar- more than double the amount of the vast bulk of samples.
Would the line of best fit change significantly if I remove that one extreme outlier?
This scatterplot and line seem to show a stronger relationship between sugar and alcohol than the last chart did- by how much did the correlation actually change?
##
## Pearson's product-moment correlation
##
## data: df$alcohol and df$residual.sugar
## t = -35.321, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4726723 -0.4280267
## sample estimates:
## cor
## -0.4506312
##
## Pearson's product-moment correlation
##
## data: subset(df, df$residual.sugar < 60)$alcohol and subset(df, df$residual.sugar < 60)$residual.sugar
## t = -36.192, df = 4895, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4812780 -0.4370779
## sample estimates:
## cor
## -0.4594624
There was barely any change in the correlation. Probably this is due to the fact I only removed one value. But the resulting scatterplot shows the previously mentioned relationship a little more clearly, as it is “zoomed in” compared to the plot showing the extreme outlier.
Comparing residual sugar and quality directly shows there is no clear ( i.e. hardly any) relationship between the two variables. I do find this surprising, as one would assume that the sweetness of a wine would affect the percieved quality in some way (negatively or positively). We can see though, thinking back to the histogram of residual sugar, that most samples have very low levels of residual sugar, i.e. close to 1 g / dm^3.
pH and fixed acidity- why do they only have a medium correlation? pH gives oyu how basic or acidic a substance/ liquid is, I would have expected a very high correlation, but it is only ca. 0.4- why? Perhaps because the pH values of the samples are actually very close together?
Showing the pH scale from zero to 14 on the y-axis shows the narrow band of pH values relative to fixed acidity… wait. How is that even possible? It means that the fixed acidity measures something different from pH!
##
## Pearson's product-moment correlation
##
## data: no_acidity_outliers$pH and no_acidity_outliers$fixed.acidity
## t = -32.81, df = 4886, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.4476053 -0.4016516
## sample estimates:
## cor
## -0.4249022
Actually reduces correlation ever so slightly.
So density is within a pretty narrow range, except for that one extreme outlier. It is at the same quality rating (six) as the outlier for residual sugar. As sugar and density have the highest correlation, this is to be expected, we will see this clearly in the next plot.
We can also see that higher quality wines tend to have a lower density, i.e. the correlation is negative.
This is a plot of the strongest direct relationship in the data. We can clearly see the diagonal shape the data points create, indicating a strong positive correlation.
Surprisingly, to me at least, there are no very strong direct relationships between quality and the quantative variables. The variable with the strongest relationship to quality is alcohol, followed by density. Alcohol and density are clearly related (cor 0.4), due I assume to the chemical properties of liquids containing alcohol. I would then have expected the correlation between density and quality to be stronger.
As density correlates highly with residual sugar I will be looking at how density, residual sugar and alcohol relate to quality.
Total sulfur dioxide and free sulfur dioxide correlate strongly, unsurprisingly. The fact that pH and acidity don’t have a stronger (negative) relationship surprises me a little, as pH is just a measure of how basic or acidic a substance is.
The highest correlation is density to residual sugar at ca. 0.8.
This plot shows that the higher alcohol/ lower sugar wines tend to be considered higher quality. Except, there are some very high quality wines- between 11-13% alcohol, that have a fairly high sugar content (between 5 and 15 g / dm^3) compared to most of the good/ ok wines with high alcohol content having little sugar, below 5 g / dm^3 I’d say. Let’s try swapping quality and alcohol around to see if that helps visually:
OK, so here we can see better that the wines with lower alcohol content (redder) tend to have more residual sugar. This can be seen most clearly be looking at higher quality wines (>=7) that tend to have higher alcohol contents but if they have high residual sugar, they then tend to have lower levels of alcohol. As quality increases, residual sugar goes down, but as there are fewer samples of high quality wines, it’s harder to say how siginficant this pattern is. Remembering the histogram of residual sugar again: most values are around 1 g / dm^3, we can see that in the columns too, most values (overplotted) are at the bottom, i.e. around 1).
Here we can see quite nicely the inverse relationship between density and alcohol, and to a lesser extent quality (The orange and yellow data points (quality 7 and 8) are clustered to the bottom left of the plot).
Here I have added an additional dimension (residual sugar) to the previous chart: Larger plot points indicate higher levels of residual sugar. I am trying to visualise how residual sugar interacts with the other plotted variables. I hope I am not reading to much into this visualisation but I would say that as density decreases and alcohol levels increase, that plot points also decrease (except for the outliers). I would also provisionally say that there seems to be a sweet spot for quality at ca 12.5% alcohol, with generally lower residual sugar.
I think I was trying to show that alcohol, density and sugar all work together in explaining quality. But I am not so sure this is clearly borne out by what I have explored so far. If I were to build a model, I would be interested to see if residual sugar (in addition to density and alcohol) helps predict wine quality or not.
Due to the clear relationships between sugar and density and density and alcohol, I was assuming that bringing these variables together would show far more clearly or strikingly the relationship between all 3 of these variables and ultimately, quality.
On a logarithmic scale, we can see a slightly bimodal histogram. This shows that the values in the tail are mainly just below or just above 1, giving a bit more detail about the distribution of the values in teh tail of this histogram.
The plot points of these two variable line up very nicely, approximating a diagonal line from bottom left to top right. As there are far more samples around 1 g/dm^3 for residual sugar, the data “thins out” from left to right. A few extreme outliers are also visible, but the corellation of high residual sugar resulting in high density holds for them too.
This scatterplot shows the relationship of three quantative variables to the single qualitative variable, i.e. quality. As density decreases and alcohol increases, quality tends to increase too. The relationship between sugar and quality is quite weak, but residual sugar is clearly linked to density and to a lesser extent alcohol also.
Considering that these samples are all of a “natural” product, wine, I expected the distribution of most of the measured variables to be normal. Alcohol and residual sugar did not fit this picture very well, so I thought they might be interesting variables to explore. I also intuitively assumed that the sweetness and alcohol content of wine would reflect on its percieved quality. I was surprised that there were so few strong correlations, I expected the different chemical properties of the wines to be more closely linked, for some reason. The clearest one here was density and residual sugar. And since density and alcohol are correlated, I somehow also assumed that in combination they would show a clearer picture. I found this tricky to show with these scatterplots. As I mentioned in the analysis on the multivariate plots, I believe modelling these relationships might improve the understanding of these relationships.